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Abstract 

We present a method to estimate block membership of nodes in a ran- 
dom graph generated by a stochastic blockmodel. We use an embedding 
procedure motivated by the random dot product graph model, a partic- 
ular example of the latent position model. The embedding associates 
each node with a vector; these vectors are clustered via minimization 
of a square error criterion. We prove that this method is consistent for 
assigning nodes to blocks, as only a negligible number of nodes will be 
mis-assigned. We prove consistency of the method for directed and undi- 
rected graphs. The consistent block assignment makes possible consistent 
parameter estimation for a stochastic blockmodel. We extend the result 
in the setting where the number of blocks grows slowly with the num- 
ber of nodes. Our method is also computationally feasible even for very 
large graphs. We compare our method to Laplacian spectral clustering 
through analysis of simulated data and a graph derived from Wikipedia 
documents. 



1 Background and Overview 

Network analysis is rapidly becoming a key tool in the analysis of modern 
datasets in fields ranging from neuroscience to sociology to biochemistry. In 
each of these fields, there are objects, such as neurons, people, or genes, and 
there are relationships between objects, such as synapses, friendships, or pro- 
tein interactions. The formation of these relationships can depend on attributes 
of the individual objects as well as higher order properties of the network as 
a whole. Objects with similar attributes can form communities with similar 
connective structure, while unique properties of individuals can fine tune the 
shape of these relationships. Graphs encode the relationships between objects 
as edges between nodes in the graph. 

Clustering objects based on a graph enables identification of communities 
and objects of interest as well as illumination of overall network structure. Find- 
ing optimal clusters is difficult and will depend on the particular setting and 



task. Even in moderately sized graphs, the number of possible partitions of 
nodes is enormous, so a tractable search strategy is necessary. Methods for 
finding clusters of nodes in graphs are many and varied, with origins in physics, 
engineering, and statistics; Fortunato (2010) and Fjallstrom (1998) provide com- 
prehensive reviews of clustering techniques. In addition to techniques motivated 
by heuristics based on graph structure, others have attempted to fit statistical 
models with inherent community structure to a graph. (Airoldi et al., 2008; 
Handcock et al., 2007; Nowicki and Snijdcrs, 2001; Snijders and Nowicki, 1997). 

These statistical models use random graphs to model relationships between 
objects; Goldenbcrg et al. (2010) provides a review of statistical models for 
networks. A graph consists of a set of nodes, representing the objects, and a set 
of edges, representing relationships between the objects. The edges can be either 
directed (ordered pairs of nodes) or undirected (unordered pairs of nodes). In 
our setting, the node set is fixed and the set of edges is random. 

Hoff et al. (2002) proposed what they call a latent space model for random 
graphs. Under this model each node is associated with a latent random vector. 
There may also be additional covariate information which we do not consider 
in this work. The vectors are independent and identically distributed and the 
probability of an edge between two nodes depends only on their latent vectors. 
Conditioned on the latent vectors, the presence of each edge is an independent 
Bernoulli trial. 

One example of a latent space model is the random dot product graph 
(RDPG) model (Young and Scheinerman, 2007). Under the RDPG model, the 
probability an edge between two nodes is present is given by the dot product 
of their respective latent vectors. For example, in a social network with edges 
indicating friendships, the components of the vector may be interpreted as the 
relative interest of the individual in various topics. The magnitude of the vec- 
tor can be interpreted as how talkative the individual is, with more talkative 
individuals more likely to form relationships. Talkative individuals interested in 
the same topics are most likely to form relationships while individuals who do 
not share interests are unlikely to form relationships. 

We present an embedding motivated by the RDPG model which uses a 
decomposition of a low rank approximation of the adjacency matrix. The de- 
composition gives an embedding of the nodes as vectors in a low dimensional 
space. This embedding is similar to embeddings used in spectral clustering but 
operates directly on the adjacency matrix rather than a Laplacian. We discuss 
a relationship between spectral clustering and our work in Section 7. 

Our results are for graphs generated by a stochastic blockmodel (Holland 
et al., 1983; Wang and Wong, 1987). In this model, each node is assigned to a 
block, and the probability of an edge between two nodes depends only on their 
respective block memberships; in this manner two nodes in the same block are 
stochastically equivalent. In the context of the latent space model, all nodes in 
the same block are assigned the same latent vector. An advantage of this model 
is the clear and simple block structure, where block membership is determined 
solely by the latent vector. 

Given a graph generated from a stochastic blockmodel, our primary goal is 
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Algorithm 1 The adjacency spectral clustering procedure for directed graphs. 
Input: A e {0,1}™ X ™ 

Parameters: d € {1,2,..., n}, K 6 {2,3,..., n} 

Step 1: Compute the singular value decomposition, A = U'S'V' T . Let £' 
have decreasing main diagonal. 

Step 2: Let U and V be the first d columns of U' and V, respectively, and 
let S be the subjmatrix^of given by the first d rows and columns. 
Step 3: Define Z = [US 1 / 2 |VS 1 / 2 ] € R nx2d to be the concatenation of the 
coordinate-scaled singular vector matrices. 

Step 4'- Let (i/>,f) = aigmin^ ;T ^" =1 \\Z U — Vvojlli gi ye the centroids and 
block assignments, where Z u is the u th row of Z, ip € M. K xd are the centroids 
and f is a function from [n] to [K] . 
return f , the block assignment function, 



to accurately assign all of the nodes to their correct blocks. Algorithm 1 gives 
the main steps of our procedure. In summary these steps involve computing 
the singular value decomposition of the adjacency matrix, reducing the dimen- 
sion, coordinate-scaling the singular vectors by the square root of their singular 
value and, finally, clustering via minimization of a square error criterion. We 
note that Step 4 in the procedure is a mathematically convenient stand in for 
what might be used in practice. Indeed, the standard if-means algorithm ap- 
proximately minimizes the square error and we use K-rneans for evaluating the 
procedure empirically. This paper shows that the node assignments returned 
by Algorithm 1 are consistent. 

Consistency of node assignments means that the proportion of mis-assigned 
nodes goes to zero (probabilistically) as the number of nodes goes to infinity. 
Others have already shown similar consistency of node assignments. Snijdcrs 
and Nowicki (1997) provided an algorithm to consistently assign nodes to blocks 
under the stochastic blockmodel for two blocks, and later Condon and Karp 
(2001) provided a consistent method for equal sized blocks. Bickel and Chen 
(2009) showed that maximizing the Newman-Girvan modularity (Newman and 
Girvan, 2004) or the likelihood modularity provides consistent estimation of 
block membership. Choi ct al. (In press) used likelihood methods to show 
consistency with rapidly growing numbers of blocks. 

Maximizing modularities and likelihood methods are both computationally 
difficult, but provide theoretical results for rapidly growing numbers of blocks. 
Our method is related to that of McSherry (2001), in that we consider a low 
rank approximation the adjacency matrix, but their results do not provide con- 
sistency of node assignments. Rohc ct al. (2011) used spectral clustering to 
show consistent estimation of block partitions with growing number of blocks; 
in this paper we demonstrate that for both directed and undirected graphs, our 
proposed embedding allows for accurate block assignment in a stochastic block- 
model. These matrix decomposition methods are computationally feasible, even 
for graphs with a large number of nodes. 
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The remainder of the paper is organized as follows. In Section 2 we formally 
present the stochastic blockmodel, the random dot product graph model and 
our adjacency spectral embedding. In Section 3 we state and prove our main 
theorem, and in Section 4 we present some useful Corollaries. Sections 2-4 focus 
only on directed random graphs; in Section 5 we present model and results for 
undirected graphs. In Section 6 we present simulations and empirical analysis 
to illustrate the performance of the algorithm. Finally, in section 7 we discuss 
further extensions to the theorem. In the appendix, we prove some key technical 
results to prove our main theorem. 

2 Model and Embedding 

First, we adopt the following conventions. For a matrix M G R nXm , entry i,j 
is denoted by My. Row i is denoted Mj g M lxd , where Mj is a column vector. 
Column j is denoted as M.j and occasionally we refer to row i as Mj. . 

The node set is [n] = {1,2, ... ,n}. For directed graphs edges are ordered 
pairs of elements in [n] . For a random graph, the node set is fixed and the edge 
set is random. The edges are encoded in an adjacency matrix A G {0, l} nxra . 
For directed graphs, the entry A uv is 1 or according as an edge from node u 
to node v is present or absent in the graph. We consider graphs with no loops, 
meaning A uu = for all u G [n] . 

2.1 Stochastic Blockmodel 

Our results are for random graphs distributed according to a stochastic block- 
model (Holland et al., 1983; Wang and Wong, 1987), where each node is a mem- 
ber of exactly one block and the probability of an edge from node u to node v 
is determined by the block memberships of nodes u and v for all u, v G [n]. The 
model is parametrized by P G [0, l] KxK , and p G (0,1) K with J^i^i Pi = !• 
K is the number of blocks, which are labeled 1, 2, . . . , K. The block member- 
ships of all nodes are determined by the random block membership function 
t : \n\ i — y [K\. For all nodes u G [n] and blocks i G [K], t(u) — i would 
mean node u is a member of block i; node memberships are independent with 
P[t(«) = i] = Pi. 

The entry Py gives the probability of an edge from a node in block i to 
a node in block j for each i,j G [K]. Conditioned on r, the entries of A are 
independent, and A uv is a Bernoulli random variable with parameter P T ( U ) )T („) 
for all u ^ v G [n]. This gives 

P[A|r] - J[¥[A UV \t(u),t(v)} 

= Il(Pr(«),r(D)) Aut '(l - PtW.t^)) 1 

with the product over all ordered pairs of nodes. 
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The row P^. and column P.^ determine the probabilities of the presence of 
edges incident to a node in block i. In order that the blocks be distinguishable, 
we require that different blocks have distinct probabilities so that either Pj. ^ 
Pj. or P i ^ P j for all i ^ j G [K]. 

Theorem 1 shows that using our embedding (Section 2.3) and a mean square 
error clustering criterion (Section 2.4), we are able to accurately assign nodes to 
blocks, for all but a negligible number of nodes, for graphs distributed according 
to a stochastic blockmodel. 

2.2 Random Dot Product Graphs 

We present the random dot product graph (RDPG) model to motivate our 
embedding technique (Section 2.3) and provide a second parametrization for 
stochastic blockmodels (Section 2.5). Let X,Y e R nxd be such that X = 
[X U X 2 , . . . , X n ] T and Y = [Y U Y 2 , . . . , Y n } T ', where X u , Y u e R d for all u € [n]. 
The matrices X and Y are random and satisfy F[(X U ,Y V ) G [0, 1]] = 1 for all 
u,v(z [n]. Conditioned on X and Y, the entries of the adjacency matrix A are 
independent and A u „ is a Bernoulli random variable with parameter (X U ,Y V ) 
for all u 7^ v E [n]. This gives 

P[A|X,Y] = l[¥[A uv \X u ,Y v } 

(2) 

= Y[(X U ,Y V ) A ^(1- (X^)) 1 -^, 

where the product is over all ordered pairs of nodes. 

2.3 Embedding 

The RDPG model motivates the following embedding. By an embedding of an 
adjacency matrix A we mean 

(X,Y)= argmin \\A-X^ T \\ F (3) 

(Xt,Yt)eR nx,1 xR»X' J 

where d, the target dimensionality of the embedding, is fixed and known and 
|| ■ || f denotes the Frobenius norm. Though XY T may be a poor approximation 
of A, Theorems 1 and 12 show that such an embedding provides a represen- 
tation of the nodes which enables clustering of the nodes provided the random 
graph is distributed according to a stochastic blockmodel. In fact, if a graph is 
distributed according to an RDPG model then a solution to Eqn. 3 provides an 
estimate of the latent vectors given by X and Y. We do not explore properties 
of this estimate but instead focus on the stochastic blockmodel. 

Eckart and Young (1936) provided the following solution to Eqn. 3. Let 
A = U'S'V /T be the^ singular value decomposition of A, where U', V g R nxn 
are orthogonal and S' G R nxn is diagonal, with diagonals Ci(A) > 02(A) > 
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• > cr„(A) > 0, the singular values of A. Let U £ R nxd and V £ R nxd be the 
first d columns of U' and V, respectively, and let £ £ R dxd be the diagonal 
matrix with diagonals <7i(A), . . . ,<7d(A). Eqn. 3 is solved by X = US 1 / 2 and 
Y = VS 1 / 2 . 

We refer to (X, Y) as the "scaled adjacency spectral embedding" of A. We 
refer to (U,V) as the "unsealed adjacency spectral embedding" of A. The 
adjacency spectral embedding is similar to an embedding which is presented 
in Marchcttc ct al. (2011). It is also similar to spectral clustering where the 
decomposition is on the normalized graph Laplacian. 

Theorem 1 uses a clustering of the unsealed adjacency spectral embedding 
of A while Corollary 9 extends the result to clustering on the scaled adjacency 
spectral embedding. Though this embedding is proposed for embedding an 
adjacency matrix, we use the same procedure to embed other matrices. 

2.4 Clustering Criterion 

We prove that for a graph distributed according to the stochastic blockmodel, we 
can use the following clustering criterion on the adjacency spectral embedding 
of A to accurately assign nodes to blocks. Let Z £ IR nxm . We use the following 
mean square error criterion for clustering the rows of Z into K blocks, 

n 

0/>,f) = argmin^ \\Z U - ip T (u)\\l, (4) 

u=l 

where tp £ K ifxm , ip, £ K m gives the centroid of block i and f : [n] i-> [K] is 
the block assignment function. 

Again, note that other computationally less expensive criterion can also be 
quite effective. Indeed, in Section G.l, we achieve misclassification rates which 
are empirically better than our theoretical bounds using the i^-means clustering 
algorithm, which only attempts to solve Eqn. 4. Additionally, other clustering 
algorithms may prove useful in practice though presently we do not investigate 
these procedures. 

2.5 Stochastic Blockmodel as RDPG Model 

We present another parametrization of a stochastic blockmodel corresponding to 
the RDPG model. Suppose we have a stochastic blockmodel with rank(P) = d. 
Then there exist is, fj, £ ~R Kxd such that P = vfi T and by definition = 
(i>i,fj,j). Let r : [n] t-¥ [K] be the random block membership function. 

Let X £ R nxd and Y £ R nxd have row u given by X J = vK, and = 

f^rfu)' res P ec tively, for all u. Then we have 

¥[A UV = 1] = P t ( u ).t(v) = (^V(«)>Mt(u)) = (X U ,Y V ). (5) 

In this way, the stochastic blockmodel can be parametrized by £ M. Kxd 
and p provided that (i//j, T )ij £ [0,1] for all i,j £ [K]. This viewpoint proves 
valuable in the analysis and clustering of the adjacency spectral embedding. 
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Importantly, the distinctness of rows or columns in P is equivalent to the 
distinctness of the rows of v or fi. (Indeed note, that for i ^ j, P^. — Pj. = if 
and only if (yj — = 0, but rank(/x) = d so uj = uj . Similarly, P., = P j 

if and only if pi = fij.) Also note, we can take (y, fx) as the adjacency spectral 
embedding of P with target dimensionality rank(P) to get such a representation 
from any given P. 

3 Main Results 

3.1 Notation 

We use the following notation for the remainder of this paper. Let P € [0, l] KxK 
and p £ (0, 1) be a vector with positive entries summing to unity. Suppose 
rank(P) = d. Let vfi T = P with v, fi e R Kxd . We now define the following- 
constants not depending on n: 

• a > such that all eigenvalues of v T v and /i 7 fi are greater than a; 

• p > such that f3 < — or < — fij\\ for all i ^ j; 

• 7 > such that 7 < pi for all i £ [K]. 

We consider a sequence of random adjacency matrices with node set 
[n] for n G {1,2,...}. The edges are distributed according to a stochastic 
blockmodel with parameters P and p. Let : [n] h->- [if] be the random 
block membership function, which induces the matrices X( n ',Y(") € R nxd as 
in Section 2.5. Let n, = |{u : t(u) = z}| be the size of block i. 

Let XY T = USV be the singular value of decomposition, with U, V £ IR" 
and S 6 M <ix ' : ', so that (U, V) is the unsealed spectral embedding of the XY T . 
Let (X, Y) be the adjacency spectral embedding of A and let (U,V) be the 
unsealed adjacency spectral embedding of A. Finally, let W € ^nx2d ^ e ^ e 
concatenation [U|V] and similarly W = [U|V]. 

3.2 Main Theorem 

The main contribution of this paper is the following consistency result in terms 
of the estimation of the block memberships for each node based on the block 
assignment function f which assigns blocks based on W. In the following, an 
event occurs "almost always" if with probability 1 the event occurs for all but 
finitely many n € {1, 2, . . . }. 

Theorem 1. Under the conditions of Section 3.1, suppose that the number of 
blocks K and the latent vector dimension d are known. Let fW : V >— > [K] 
be the block assignment function according to a clustering of the rows of WW 
satisfying Eqn. 4- Let Sk be the set of permutations on [K]. It almost always 
holds that 

2 3 3 2 6 

min \{ueV: t(u) ^ 7r(f(«))}| < 5 logn. (6) 
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To prove this theorem, we first provide a bound on the Frobenius norm 
of AA T - (XY T )(XY T ) T , following Rohc ct al. (2011). Using this results 
and properties of the stochastic blockmodel, we then find a lower bound for 
the smallest non-zero singular value of XY T and the corresponding singular 
value of A. This enables us to apply the Davis-Kahan Theorem (Davis and 
Kahan, 1970) to show that the unsealed adjacency spectral embedding of A 
is approximately a rotation of the unsealed adjacency spectral embedding of 
XY T . 

Finally, we lower bound the distances between the at most K distinct rows of 
U and V. These gaps, together with the good approximation by the embedding 
of A is sufficient to prove consistency of the mean square error clustering of the 
embedded vectors. Most results, except the important Proposition 2 and the 
main theorem, are proved in the Appendix. 

Proposition 2. Let QW £ [0, l] nx ™ be a sequence of random matrices and 
let A( n ) £ {0, 1}" X ™ be a sequence of random adjacency matrices corresponding 
to a sequence of random graphs on n nodes for n £ {1,2,...}. Suppose the 
probability of an edge from node u to node v is given by q!"^ and that the 
presence of edges are conditionally independent given Q^") . Then the following 
holds almost always: 

|j A („) A („) T _ q(«)q(«) t || f < v^ri^Vlogn. (7) 

Proof. For ease of exposition, we dropped the index n from C^ n \ Note that, 
conditioned on Q, A UU) and are independent Bernoulli random variables 
for all w £ [n] provided u/u. For each w ^ {u,v}, \ UW A VW is a conditionally 
independent Bernoulli with parameter Q UW C} VW . For we have 

^-^-uv QQun ^ . {A-uwA- vw QiiujQvu;) 

w£{u,v} (8) 

Thus, by Hoeffding's inequality, 

P[(AA^„ - QQ^) 2 > 2(n - 2) logn + 2n + 4 | Q] < 2n~\ (9) 

We can integrate over all choices of Q so that Eqn. 9 holds unconditionally. 

For the diagonal entries, (AAj u — QQj u ) 2 < n 2 always. The diagonal terms 
and the 2n + 4 terms from equation 9 all sum to at most 3n 3 + 4n 2 < n 3 log n 
for n large enough. Combining these inequalities we get the inequality 

P[||AA T - QQ T |||. > 3n 3 logn] < 2?i" 2 . (10) 

Applying the Borel-Cantelli Lemma gives the result. □ 

Taking Q = XY T gives the following immediate corollary. 
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Corollary 3. It almost always holds that 



\\AA T - XY T (XY T f\\ F < V3n 3/2 v/b^ (11) 

and 

|| A T A - (XY T ) T XY T || F < VSn 3 / 2 ^]^. (12) 

The next two results provide bounds on the singular values of XY T and 
A based on lower bounds for the eigenvalues of P and the block membership 
probabilities. 

Lemma 4. It almost always holds that ajn < c [ j(XY T ) and it always holds 
that er d+ i(XY T ) = and a 1 (XY T ) < n. 

Corollary 5. It almost always holds that 

ajn < a d (A) and a d+1 (A) < 3 1/4 n 3/4 log 1/4 n (13) 
and it always holds that o"i(A) < n. 

We note that Corollary 5 immediately suggests a consistent estimator of the 
rank of XY T given by d = max{c?' : a d '(A) > S 1 / 4 ^ 3 / 4 log 1 / 4 n}. Presently we 
do not investigate the use of this estimator and assume that the d = rank(P) is 
known. 

The following is the version of the Davis-Kahan Theorem (Davis and Kahan, 
1970) as stated in Rohe et al. (2011). 

Theorem 6 (Davis and Kahan). Let H, H' £ M. nxn be symmetric, suppose S C 
K is an interval, and suppose for some positive integer d that W, W £ M. nxd 
are such that the columns of W form an orthonormal basis for the sum of the 
eigenspaces o/H associated with the eigenvalues o/H in S and that the columns 
o/W form an orthonormal basis for the sum of the eigenspaces o/H' associated 
with the eigenvalues of H' in S. Let 5 be the minimum distance between any 
eigenvalue o/H in S and any eigenvalue o/H not in S. Then there exists an 
orthogonal matrix R e R dxd such that ||WR- W'\\ F < ^||H-H'|| F . 

For completeness, we provide a brief discussion of this important result in 
Appendix B. Applying Theorem 6 and Lemma 4 to AA T and XY T (XY T ) T , 
we have the following result. 

Lemma 7. It almost always holds that there exists an orthogonal matrix R € 
M 2dx2d such that || WR- W|| < V2^^^. 

Recall that XY T = USV T . We now provide bounds for the gaps between 
the at most K distinct rows of U and V. 

Lemma 8. It almost always holds that, for all u,v such that X u ^ X v , \\U U — 
U v \\ > PyTc^fji- 1 / 2 . Similarly, for all Y u ^ Y v , \\V U - V v \\ > /jyS^n" 1 / 2 . As a 
result, \\W U — W v \\ > ft^Jcryn^ 1 / 2 for all u,v such that t(u) ^= t(v). 



9 



We now have the necessary ingredients to show our main result. 



Proof of Theorem 1. Let ip and f satisfy the clustering criterion for W (where 
W = [U|V] takes the role of Z in Section 2.4). Let C G R nx2d have row u given 
by C u = Vv (u) . Then Equation 4 gives that ||C - W|| F < ||WR - W|| F as W 
has at most K distinct rows. Thus, Lemma 7 gives that 

||C - WR|| F < || C - W||f + ||W - WR|| F 

V6 /logn ( 14 ) 



< 2 3 / 



2 



a 2 7 2 



Let 81,82, ■ ■ ■ , Bk be balls of radius r = f ydryn -1 / 2 each centered around the 
K distinct rows of W. By Lemma 8, these balls are almost always disjoint. 

Now note that almost always the number of rows u such that ||C U — W U R|| > 
r is at most a 2 5 ^ 2 ^5 log n. If this were not so then infinitely often we would have 



o3o2 fi a 

|C - WR|| F > - 1 ^r logn^-^n- 1 ' 2 
a t >p' I j° 3 



= 2 3/2 / logn 



(15) 



in contradiction to Eqn. 14. Since rii > 771 > logn almost always, each 

ball Bi can contain exactly one of the K distinct rows of C. This gives the 
number of misclassifications as \ „ 2 6 5 log n as desired. □ 

This gives that a clustering of the concatenation of the matrices U and V 
from the singular value decomposition gives an accurate block assignment. One 
may also cluster the scaled singular vectors given by X and Y without a change 
in the order of the number of misclassifications. 



4 Extensions 

Corollary 9. Under the conditions of Theorem 1, let f : V — > [K] be a cluster- 
ing of Z = [X|Y], Then it almost always holds that 

2 3 3 2 6 

min \{u £ V : it(t(u)) ^ t(u)}\ < -g^-^logn. (16) 

The proof relies on the fact that the square root of the singular values are 
all of the same order and differ by a multiplicative factor of at most ^/cry. 

We now present consistent estimators of the parameters P and p for the 
stochastic blockmodel. Consider the following estimates 

h k = \{u : t(u) = k}\, pk = — (17) 

n 
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and 



P 



riiUj - 

(u,v)Gf~ 1 (i)xf- 1 (j) 



_ (18) 
^2 A uv, ifi=j. 



This gives the following corollary. 
Corollary 10. Under the conditions of Theorem 1, 

min \pi - p n(i) \ ^ (19) 
and min |P, (!)t ( 3 -) - Py| ^7 (20) 

/or aZZ 2, j € [if] as n -> 00. 

The proof is immediate from Theorem 1 and the law of large numbers. 

If we take (0, ft) to be the adjacency spectral embedding of P then we also 
have that and fi provide consistent estimates for (u, fi), the adjacency spectral 
embedding of P, in the following sense. 

Corollary 11. Under the conditions of Theorem 1, with probability 1 there 
exists a sequence of orthogonal matrices ,R,2 £ R dxd such that 

\\0 - vR,[ n) \\ F -> and \\p, - /*R^ n) ||j- -> 0. (21) 

The proof relies on applications of the Davis-Kahan Theorem in a similar 
way to Lemma 7. 



5 Undirected Version 

We now present the undirected version of the stochastic blockmodel and state 
the main result. The setting and notation are from Section 3.1. 

For the undirected version of the stochastic blockmodel, the matrix P is 
symmetric and P^ = Pji gives the probability of an edge between a node in 
block i and a node in block j for each i,j € [K]. Conditioned on r, A uv is a 
Bernoulli random variable with parameter P t (u),t(v) for all u 7^ v € [n\. As A is 
symmetric, all entries of A are not independent, but the entries are independent 
provided two entries do not correspond to the same undirected edge. 

For the undirected version a re-parametrization of the stochastic block model 
as a RDPG model as in Section 2.5 is not always possible. However, we can 
find i/,fi £ ~R Kxd such that vpL T = P and u and /jl have equal columns up 
to a possible change in sign in each column. This means the rows of v and 
are distinct so it is not necessary to cluster on the concatenated embeddings. 
Instead, we consider clustering the rows of U or X, which gives a factor of two 
improvement in misclassification rate. 
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Theorem 12. Under the undirected version of the stochastic blockmodel, sup- 
pose that the number of blocks K and the latent feature dimension d are known. 
Let t : V [K] be a block assignment function according to a clustering of the 
rows o/U satisfying the criterion in Eqn. 4- It almost always holds that 



Corollary 9 holds when clustering on X, with the same factor of 2 improve- 
ment in misclassification rate. Corollaries 10 and 11 also hold without change. 

6 Empirical Results 

We evaluated this procedure and compared it to the spectral clustering proce- 
dure of Rohc ct al. (2011) for both simulated data (§ 6.1) and using a Wikipedia 
hyperlink graph (§ 6.2). 

6.1 Simulated Data 

To illustrate the effectiveness of the adjacency spectral embedding, we simulate 
random undirected graphs generated from the following stochastic blockmodel: 



For each n £ {500, 600, . . . , 2000}, we simulated 100 monte carlo replicates from 
this model conditioned on the fact that \{u £ [n] : t(u) = i}\ = p^n for each 
i £ {1, 2}. In this model we assume that d = 2 and K = 2 are known. 

We evaluated four different embedding procedures and for each embedding 
we used if-means clustering, which attempts to iteratively find the solution 
to Eqn. 4, to generate the node assignment function f. The four embedding 
procedure are the scaled and unsealed adjacency spectral embedding as well as 
the scaled and unsealed Laplacian spectral embedding. The Laplacian spectral 
embedding uses the same spectral decomposition but works with the normalized 
Laplacian (as defined in Rohc ct al. (2011)) rather then the adjacency matrix. 
The normalized Laplacian is given by L = D _1 / 2 AD -1 / 2 where D £ R nxn is 
diagonal with T> vv = deg(w), the degree of node v. 

We evaluated the performance of the node assignments by computing the 
percentage of mis-assigned nodes, min w6 s 2 \{u £ [n] : t(u) ^ 7r(f (u))}|/n, as 
in Eqn. 6. Figure 1 demonstrates that performance of if-means on all four 
embeddings improves with increasing number of nodes. It also demonstrates 
(via a paired Wilcoxon test) that for these model parameters the adjacency 
embedding is superior to the Laplacian embeddings for large n. In fact, for n > 
1400 we observed that for each simulated graph the scaled adjacency embedding 
always performed better than both Laplacian embeddings. We note that these 



min \{u £ V : t(u) ^ n(f {u))}\ < 



2 2 3 2 6 

a 5 /3 2 7 5 



log n. 



(22) 




(23) 
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Figure 1: Mean error for 100 monte carlo replicates using if-means on four 
different embedding procedures. 

model parameters were specifically constructed to demonstrate a case where the 
adjacency embedding is superior to the Laplacian embedding. 

Figure 2 shows an example of the scaled adjacency (left) and scaled Lapla- 
cian (right) spectral embeddings. The graph has 2000 nodes and the points 
are colored according to their block membership. The dashed line shows the 
discriminant boundary given by the if-means algorithm with K = 2. 

6.2 Wikipedia Graph 

For this data, each node in the graph corresponds to a Wikipedia page and 
the edges correspond to the presence of a hyperlink between two pages (in 
either direction). We consider this as an undirected graph. Every article within 
two hyperlinks of the article "Algebraic Geometry" was included as a node in 
the graph. This resulted in n — 1382 nodes. Additionally, each document, 
and hence each node, was manually labeled as one of the following: Category, 
Person, Location, Date and Math. 

To illustrate the utility of this algorithm we embedded this graph using the 
scaled adjacency and Laplacian procedures. Figure 3 shows the two embeddings 
for d = 2. The points are colored according to their manually assigned labels. 
First we note that on the whole the two embeddings look moderately different. 
In fact, for the adjacency embedding one can see that the orange points are 
well separated from the remaining data. On the other hand, with the Laplacian 
embedding we can see that the red points are somewhat separated from the 
remaining data. The dashed lines show the result boundary as determined by 
if -means with K — 2. 

To evaluate the performance we considered the 5 different tasks of identifying 
one block and grouping the remaining blocks together. For each of the 5 blocks, 
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Figure 2: Scatter plots of the scaled adjacency (left) and Laplacian (right) 
embeddings of a 2000 node graph. 

we compared each of the one-vs-all block labels to the estimated labels from 
A-means, with A = 2, on the two embeddings. Table 1 shows the number of 
incorrectly assigned nodes, as in Eqn. 6, as well as the adjusted Rand index 
(Hubert and Arabie, 1985). The adjusted Rand index (ARI) has the property 
that the optimal value is 1 and a value of zero indicates the expected value if 
the labels were assigned randomly. 

We can see from this table that A'-means on the adjacency embedding iden- 
tifies the separation of the Date block from the other four while on the Laplacian 
embedding A-means identifies the separation of the Math block from the other 
four. This indicates that for this data set (and indeed more generally) the choice 
of embedding procedure will depend greatly on the desired exploitation task. 

We note that for both embeddings, the clusters generated using A-means, 
with A = 5, poorly reflect the manually assigned block memberships. We have 
not investigated beyond the illustrative 2-dimensional embeddings. 





Category (119) 
Error ARI 


Person (372) 
Error ARI 


Location (270) 
Error ARI 


Date (191) 
Error ARI 


Math (430) 
Error ARI 


A 


242 -0.08 


495 -0.07 


341 0.01 


130 0.47 


543 0.06 


L 


299 -0.02 


495 -0.02 


476 -0.1 


401 -0.10 


350 0.19 



Table 1: One versus all comparison of each block against the estimated A-means 
block assignments with A = 2. 
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Figure 3: Scatter plots for the Wikipedia graph. The left pane show the scaled 
adjacency embedding and the right pane show the scaled Laplacian embedding. 
Each point is colored according to the manually assigned labels. The dashed 
line represents the discriminant boundary determined by if-means with K = 2. 

7 Discussion 

Our simulations demonstrate that for a particular example of the stochastic 
blockmodel, the proportion of mis-assigned nodes will rapidly become small. 
Though our bound shows that the number of mis-assigned nodes will not grow 
faster than O(logn), in some instances this bound may be very loose. We also 
demonstrate that using the adjacency embedding over the Laplacian embedding 
can provide performance improvements in some settings. It is also clear from 
Figure 2 that the use of other unsupervised clustering techniques, such as Gaus- 
sian mixture modeling, will likely lead to further performance improvements. 

On the Wikipedia graph, the two-dimensional embedding demonstrates that 
the adjacency embedding procedure provides an alternative to the Laplacian 
embedding and the two may have fundamentally different properties. Both the 
Date block and the Math block have some differentiating structure in the graph 
but these structures are illuminated more in one embedding then the other. 
This analysis suggests that further investigations into comparisons between the 
adjacency embedding and the Laplacian embeddings will be fruitful. 

Our empirical analysis indicates that the adjacency spectral embeddings and 
the Laplacian spectral embeddings are strongly related while the two embed- 
dings may emphasize different aspects of the particular graph. Rohe et al. (2011) 
used similar techniques to show consistency of block assignment on the Lapla- 
cian embedding and achieved the same asymptotic rates of misclassification. 
Indeed, if one considers the embedding given by D" 1 / 2 ^, then this embedding 
will be very close to the scaled Laplacian embedding and may provide a link 
between the two procedures. 

Note that consistent block assignments are possible using either the singular 
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vectors or the scaled version of the singular vectors. The singular vectors them- 
selves are essentially a whitened version of scaled singular vectors. Since the 
singular vectors are orthogonal, the estimated covariance of rows of the scaled 
vectors is proportional to the diagonal matrix given by the singular values of A. 
This suggests that clustering using a criterion invariant to coordinate-scalings 
and rotations will likely have similar asymptotic properties. 

Critical to the proof is the bound provided by Proposition 2. Since this 
bound does not depend on the method for generating Q, it suggests that exten- 
sions to this theorem are possible. One such extension is to take the number 
of blocks K — K n to go slowly to infinity. For K n growing, the parameters a, 
P, and 7 are no longer constant in n, so we must impose conditions on these 
parameters. If we take d fixed and assume these parameters go to slowly, it 
is possible to allow K n = n e for e sufficiently small. Under these conditions, it 
can be shown that the number of incorrect block assignments is 0(717), which is 
negligible to block sizes. Our proof technique breaks down for K n = f^n 1 / 4 ) as 
Proposition 2 no longer implies a gap in the singular values of A. 

In order to avoid the model selection quagmire, we assumed in Theorem 1 
that the number of blocks K and the latent feature dimension d are known. 
However, the proof of this theorem suggests that both K and and d can be 
estimated consistently. Corollary 5, shows that all but d of the singular values 
of A are less than S 1 / 4 ™ 3 / 4 log 1 / 4 n for n large enough. As discussed earlier, 
this shows that d = max{i : <Ji(A) > S 1 / 4 ^ 3 / 4 log 1 / 4 n} will be a consistent 
estimator for d. Though this estimator is consistent, the required number of 
nodes for it to become accurate will depend highly on the sparsity of the graph, 
which controls the magnitude of the largest singular values of A. Furthermore, 
our bounds suggest that the number of nodes required for this estimate to be 
accurate will increase exponentially as the expected graph density decreases. 

Estimating K is more complicated, and we do not present a formal method 
to do so. We do note that the proof shows that most of the embedded vectors 
are concentrated around K separated points. An appropriate covering of the 
points by slowly vanishing balls would allow for a consistent estimate of K. 
More work is needed to provide model selection criteria which are practical to 
the practitioner. 

Note that some practitioners may have estimates or bounds for the param- 
eters P and p, derived from some prior study. In this case, provided bounds 
on a, /3, and 7 can be determined, the proof can be used to derive high proba- 
bility bounds on the number of nodes that have been assigned to the incorrect 
block. This may also enable the practitioner to choose n to optimize some 
misassignment and cost criteria. 

The proofs above would remain valid if the diagonals of the adjacency ma- 
trix are modified provided that each modification is bounded. In fact, modifying 
the diagonals may improve the embedding to give lower numbers of misassign- 
ments. Marchctte ct al. (2011) suggests replacing the diagonal element A uu 
with deg(u)/(n — 1) for each node u £ [n]. Scheincrman and Tucker (2010) 
provided an iterative algorithm to impute the diagonal. An optimal choice the 
diagonal is not known for general stochastic blockmodels. 
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Another practical concern is the possibility of missing data in the observed 
graph. One example may be that each edge in the true graph is only observed 
with probability p in the observed graph. Our theory will be unaffected by 
this type of error since the observed graph is also distributed according to a 
stochastic blockmodel with edge probabilities P' = pP. As a result, asymptotic 
consistency remains valid. We may also allow p to decrease slowly with n and 
still achieve asymptotically negligible misassignments. However, typically the 
finite sample performance will if p is small. 

Overall, the theory and results presented suggest that this embedding pro- 
cedure is worthy of further investigation. The problems estimating K and d, 
choosing between scaled and unsealed embedding and between the adjacency 
and the Laplacian will all be considered in future work. This work is also being 
generalized to more general latent position models. 

Finally, under the stochastic blockmodel, our method will be less computa- 
tionally demanding than ones which depend on maximizing likelihood or mod- 
ularity criterion. Fast methods to compute singular value decompositions are 
possible, especially for sparse matrices. There are a plethora of methods for ef- 
ficiently clustering points in Euclidean space. Overall, this embedding method 
may be valuable to the practitioner to provide a rapid method to identify blocks 
in networks. 

A Proofs of Technical Lemmas 

In this appendix, we prove the technical results stated in Section 3.2. 

Lemma 4. It almost always holds that ajn < ad(X.Y T ) and it always holds 
that o- d+ i(XY T ) = and cti(XY t ) < n. 

Proof. Since XY T e [0, l]" xn , the nonnegative matrix XY T (XY T ) T has en- 
tries bounded by n. The row sums are bounded by n 2 giving that a 2 (X.Y T ) = 
Ai(XY T (XY T ) T ) < n 2 . Since X and Y are at most rank d, we have Od+i(XY) = 
0. 

The nonzero eigenvalues of XY (XY ) = XY Y X are the same as the 
nonzero eigenvalues of Y T YX T X. It almost always holds that n.; > for all 
i so that 

K K 

X T X = niVivT = jnv T v + ^(n, - "fnjv.vj (24) 

i=l i=l 

is the sum of two positive semidefinite matrices, the first of which has eigenvalues 
all greater then ajn. This gives A^(X T X) > ajn and similarly Ad(Y T Y) > 
ajn. This gives that Y T YX T X is the product of positive definite matrices. 
We then use a bound on the smallest eigenvalues of the product of two positive 
semi-definite matrices, so that A d (Y T YX T X) > A d (Y T Y)A (i (X T X) > (a 7 n) 2 
(Zhang and Zhang, 2006, Corollary 11). This establishes cr^(XY T ) > {a^n) 2 . 

□ 
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Corollary 5. It almost always holds that ajn < c,j(A) and 0^+1 (A) < 
gi/4 n 3/4 Jog 1 / 4 n anc i it always holds that 01(A) < n. 

Proof. First, by the same arguments as Lemma 4 we have 01(A) < n. By 
Weyl's inequality (Horn and Johnson, 1985, §6.3), we have that 

|o?(A) - 2 (XY r )| = |A;(AA T ) - A,(XY r (XY?y)| 

< ||AA T -XY T (XY T ) T || F . 1 j 

Together with Corollary 3 this shows that 0^+1 (A) < S 1 / 4 ™ 3 / 4 log 1 / 4 n almost 
always. Since 7 < for each i, Lemma 4 can be strengthened to show that 
there exists e > 0, not dependent on n, such that (cry + e)n < 0d(XY T ). 
Thus, we have that (07 + e) 2 n 2 < 2 ,(XY T ) so that (cr/) 2 n 2 < 0^(A) since 
V / 3n 3 / 2 -\/logri < e 2 n 2 for n large enough. □ 

The singular value decomposition of XY T is given b USV T . The next 
result provides bounds for the gaps between the at most K distinct rows of U 
and V. Recall that for a matrix M, row u is given for all u. 

Lemma 8. It almost always holds that, for all u,v such that X u 7^ X v , \\U U — 
U v \\> p^orfrT 1 / 2 . Similarly, for all Y u ^ Y v , \\V U -V V \\ > p^/orf-nT 1 / 2 . As a 
result, \\W U — W v \\ > /3^/cryn -1 / 2 for all u,v such that r(u) 7^ t(v). 

Proof. Let Y T Y = ED 2 E T for E G R dxd orthogonal, D G R dxd diagonal. 
Define G = XE, G' = GD, and U' = US. Let u,v be such that X u ^ X v . 
From Lemma 4 and its proof, diagonals of D are almost always at least y/crfn 
and the diagonals of S are at most n. 
Now, 

G'G' T = GD 2 G T = XED 2 E T X T = XY T YX T 

= usv T vsu T = us 2 u T = u'u' T . < ' 26 ' ) 

Let e G K" denote the vector with all zeros except 1 in the u th coordinate and 
— 1 in the v th coordinate. By the above we have \\G' U — G' v \\ 2 — e T G'G' T e = 
e T V'V' T e = \\U' U - U' v \\ 2 . Therefore we obtain that /? < \\X U - X v \\ = \\G U - 
G v \\ <^\\G' U - G[,\\ =-^\\U' u - U'J < ^n\\U u - U v \\, as desired. 

A symmetric argument holds for \\V U — V v \\. For \\W U — W v \\ note that if 
t(u) 7^ t(v) then either U u 7^ U v or V u 7^ V v . □ 

Lemma 7. It almost always holds that there exists an orthogonal matrix R G 

R 2dx2<i h that II WR _ ^|| < ^JSn, 

11 11 — v a ^j^ y n 

Proof. Let S = (|a 2 7 2 n 2 , 00). By Lemma 4 and Corollary 5, it almost always 
holds that exactly d eigenvalues of AA T and XY T (XY T ) T are in S. Addition- 
ally, Lemma 4 shows that the gap S > a 2 j 2 n 2 . Together with Corollary 3, we 
have that 

^» AAT - xY ; (XYT)r|lF < ^'";fr. ,27) 
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This shows there exists an Ri G R dxd such that ||URi - U|| F < ^y^^. 
Now note that all of the above could be repeated for A T A and (XY T ) T XY T , 

to find R 2 G R dxd such that || VR 2 - V|| F < ^\f^- Taking R as the direct 
sum of Ri and R 2 gives the result. □ 



B Davis-Kahan Theorem 

We now state and provide a brief discussion of the Davis-Kahan theorem (Davis 
and Kahan, 1970; Rohc ct al., 2011). First, we consider some general results 
from the theory of Grassmann spaces (Qi ct al., 2005). Let Gd,n denote the set 
of d-dimensional subspaces of M. n . Two important metrics on Gd,n are the gap 
metric d g and the Hausdorff metric dh which are defined as follows. For all 

w,Weg d , n , 



d g (W,W) = 



d h (yv,W) 



. ^sin 2 (9 4 (W, W) (28) 
\ »=i 



2 



where 0i(W, W), 2 (W, W), . . . 6 d (W, W) denote the principal angles between 
W and W. By simple trigonometry d h (W,W) < \f2 ■ d g (W,W). Suppose 
W, W e M. n - d have columns which are orthonormal bases for W and W', re- 
spectively. It is well known that dh(W,W) — minR ||WR — W'|j F where the 
minimum is over all orthogonal matrices R € K dxd . 

The next theorem states the original form of the theorem from Davis and 
Kahan (1970) followed by the version proved in Rohc ct al. (2011). 

Theorem 6 (Davis and Kahan). Let H,H' g M™ xn be symmetric, suppose 
5 C 1 is an interval, and suppose for some positive integer d that W G Qd,n 
is the sum of the eigenspaces of H associated with the eigenvalues of H in S, 
and that W" G Qd.n is the sum of the eigenspaces of H' associated with the 
eigenvalues of H' in S . If S is the minimum distance between any eigenvalue of 
H in S and any eigenvalue of H not in S then 8 ■ d g (W, W) < ||H - H'\\ F . 

Furthermore, suppose W, W G R nxd are such that the columns of W form 
an orthonormal basis for W and that the columns W form an orthonormal 
basis for W 1 . Then there exists an orthogonal matrix R G M. dxd such that 
||WR-W'|| F < ^||H-H'|| F . 

From the preceding analysis we see that the version from Rohe ct al. (2011) 
follows from the original theorem; indeed, we have for some orthogonal R G 
R dxd that ||WR-R'|| F =d h (W,W) < V2d g (W,W) < ^||H-H'|| F . 
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